Constraint-based Local Move Definitions for 
Lattice Protein Models Including Side Chains 



Martin Mann 1 , Mohamed Abou Hamra 2 , Kathleen Stcinhofcl 3 , and Rolf 

Backofcn 1 

1 University of Freiburg, Bioinformatics, 79110 Freiburg, Germany, 
{mmann.backof en}@inf ormatik .uni-f reiburg . de, 
2 German University in Cairo, 5th Settlement New Cairo City, Cairo, Egypt, 
mohamed. abdel-f attah@student .guc . edu. eg, 
3 King's College London, Department of Computer Science, London WC2R 2LS, UK, 
kathleen. steinhof el@kcl . ac .uk, 



Abstract. The simulation of a protein's folding process is often done 
via stochastic local search, which requires a procedure to apply structural 
changes onto a given conformation. Here, we introduce a constraint-based 
approach to enumerate lattice protein structures according to fc-local 
moves in arbitrary lattices. Our declarative description is much more 
flexible for extensions than standard operational formulations. It enables 
a generic calculation of fc-local neighbors in backbone-only and side chain 
models. We exemplify the procedure using a simple hierarchical folding 
scheme. 

1 Introduction 

The in silico determination of a protein's functional fold is a well established 
problem in bioinformatics. Since X-ray or NMR studies are still time consuming 
and expensive, computational methods for ab initio protein structure predic- 
tion are needed. Despite research over the last decades, a direct calculation of 
minimal energy structures in full atom resolution is currently not feasible. Thus, 
heuristics and a wide variation of protein models have been developed to identify 
fundamental principles guiding the process of structure formation. A common 
abstraction of proteins are lattice protein models [3, 13, 14, 20]. Their discretized 
structure space enables efficient folding simulations [29, 31] while maintaining 
good modelling accuracy [25]. 

Folding simulations are often based on stochastic local searches, e.g. Monte 
Carlo simulations [29]. Different procedures, so called move sets, have been devel- 
oped to calculate the structural changes along the simulation, i.e. to enumerate 
the structural neighborhood of a certain structure. A method often applied in 
literature are k-local moves [28] that allow for structural changes within a suc- 
cessive interval of fixed length k. They are discussed in detail in Sec 3. Dotu 
and co-workers have used local moves for backbone-only HP models within a 
constraint-based large neighborhood search for optimal protein structures [9]. 
Lesh et al. introduced pull moves [15] that are widely used in recent studies [19, 



29] . Pivot moves allow for the rotation or reflection of subchains at an arbitrary 
Pivot position of the structure [17], while Zhang et al. suggested a sequential 
regrowth of structure fragments to enhance folding simulations [31]. 

All named move sets are currently restricted to backbone-only lattice pro- 
tein models, i.e. only the C a -trail of the protein is modeled. For more realistic 
protein models incorporating side chains, often a combination of different move 
sets is applied. Betancourt combined Pivot moves on the backbone with a new 
FEM move set [5], while Dima and Thirumalai have used a combination of 2- 
local moves on the backbone with a simple relocation of the side chain [8] . An 
exception is the advanced CABS model by Kolinski and co-workers [13], which 
represents the side chain in higher detail and requires more complex moves. 

Here, we introduce a generic and flexible approach to enable folding sim- 
ulations in backbone-only and side chain models using any fc-local moves (i.e. 
any interval length k) in arbitrary lattices. The constraint programming (CP) 
based formulation focuses on a description of the targeted structural neighbors 
instead of an operational encoding of the moves possible. The introduced scheme 
is therefore easy to extend with new directives or can be used for other applica- 
tions, e.g. fragment re-localization [31], as discussed later. Beneath applications 
in studies of the whole energy landscape [21], the approach is well placed to 
be applied within a local search following the framework of Pesant and Gen- 
dreau [26]. We apply our move set for side chain models within a simple folding 
simulation procedure in the style of [29] and evaluate the results with known 
protein structures. 

2 Preliminaries 

Given a lattice L C Z 3 and an according neighborhood relation ~ between co- 
ordinates of L. A backbone- only lattice protein of length n is described by (S, C) 
where S G E n denotes the sequence over some alphabet £ (e.g. the 20 proteino- 
gen amino acids) and C G L n the lattice nodes occupied. A valid lattice pro- 
tein structure satisfies connectivity of successive monomers Vi<j<„ : C, ~ Cj + i 
and their sclf-avoidingncss Vi<j<j<« : C; ^ Cj . A side chain lattice protein is 
defined by (S,C h ,C S ), i.e. a sequence S G S n , the backbone positions C b G 
L n and the side chain positions C s G L n . The side chain position C s repre- 
sents the centroid of the amino acid's side chain atoms. A valid lattice pro- 
tein structure including side chains satisfies connectivity of successive backbone 
monomers Vi<i<„ : Cf ~ Cf +1 , the connection of backbone and side chain for 
each amino acid Vi<i<„ : ~ Cf, and the selfavoidingncss of all monomers 
Vi<i<y<„ : C\ ^ Cf/\C? + C) A Cf ^ C| A Cf ^ Cf . We consider the contact 

based energy functions E h (S,C) - Ei5<S« 

e(Si,Sj) for backbone-only and 

E B (S, C b , C s ) — J2i<i<j< n e 0%j f° r side chain lattice proteins for a given 
energy contribution function e : S x S — > R. Note, the energy function for side 
chain proteins considers (as in [7]) the contacts between side chain positions 



Fig. 1. HP-optimal structures of HPPHHPPPHPHHPHHPPHPHPPHHHPHHPPHPHPH in the face- 
centered-cubic lattice, (a) backbone-only model with energy -50, (b) side chain model 
with energy -55. Colors: green - H monomers, gray - P monomers, red - backbone in 
side chain models. Calculation and visualization are done using the CPSP-package [22]. 

only! e denotes an empirical 20 amino acid contact potential as described 
in [4,23]. e HP represents the energy contribution function of the Hydrophobic- 
Polar (HP) model [14], i.e. it returns —1 if both amino acids are hydrophobic, 
otherwise. Our hydrophobic/polar (H/P) assignment follows [29]. An optimal 
structure minimizes the energy function. In the following, we denote a structure 
HP- optimal if it minimizes the energy function based on e HP . Figure 1 exem- 
plifies HP-optimal structures for both lattice protein models. In the following, 
we assume a scaled lattice such that neighbored positions in the lattice have a 
distance of 3.8A, the average C Q -atom distance in proteins. 

3 Constraint-based Local Move Set Definition 

To enable folding simulations we need a definition of structural changes that 
encodes the structural neighborhood of a given lattice protein structure. Here, 
we follow the idea of fc-local moves, that confine the difference between the initial 
and the neighbored structure to a consecutive interval of maximal length of fc. 
Therefore, we define the fc- neighborhood A4(C) of a given structure C as: 

■A4(C) = {valid structures C | 3i< s <„ : V^ [Si ... i ( s+fe _ 1 ) ] : G, = Cj} (1) 

In order to enumerate all valid structural neighbors C' £ A4(C) of a given 
lattice protein C, we have to enumerate the neighbors for all possible interval 
lengths 1 < k' < k and interval starts 1 < s < (n — k' + 1). Since we want to 
calculate each neighboring structure only once, we have to enhance the fc-local 
move definition to strict k-local moves. Here, we enforce in addition that both 
ends (C' s and C' s+k _ 1 ) of the successive interval of length k are changed, i.e. 
a strict fc-local move does not cover a fc'-local move with k' < fc, as a normal 
fc-local move in accordance with Eq. 1 docs. This ensures a unique enumeration 
of structural neighbors for an increasing k' . 

In the following, we will introduce the Constraint Satisfaction Problems 
(CSP) that describe all valid structural neighbors C' G A4(C) of a given lattice 
protein C according to strict fc-local moves in a lattice L. A CSP is given by 



(X, V, C), where we denote the set of variables X, their domains T>, and a set of 
constraints C. A solution of a CSP is an assignment ai S T>(Xi) for each vari- 
able that satisfies all constraints in C. To simplify the presentation, we utilize 
a binary neighboring constraint neigh(X, Y) that ensures Vd x eT>(X) '■ eT>(Y) '■ 

(d x ~ d y ) and vice versa. Furthermore, we use the global all-different constraint 
by Rcgin [27] to enforce pairwise differences within a set of variables. 

3.1 CSP for Backbone-only Models 

Given a valid backbone-only lattice protein structure C of length n, a move 
interval length k < n, and the start of the interval 1 < s < (n — k + 1). 
We define k variables Xi, one for each position of the interval, with T>(Xi) = 
L\ {Ci, . . . , C s _i, C s +fc, . . . , C n }. These variables have to form a valid structure, 
therefore we post all-different (X\, . . . ,Xk) and Vi<,<fc : neigh(Xi, Xj+i). Since 
we describe a substructure, it has to be connected to the interval borders: if 
s > 1 : neigh {X\, C s _i) and if (s + k — 1) < n : neigh(Xfc, C s+ fc). Finally we 
enforce that both ends of the interval are different from the old placement, i.e. 
X\ ^ Ci and Xk ^ Ci+k-i, to enumerate strict fc-local move neighbors only. 

The presented CSP is similar to the work of Dotu et al. [9] , but in contrast 
ensures the uniqueness of each move. Thus, each neighbored structure is available 
only via a single interval. This is of high importance to enable a non-redundant 
enumeration of a structure's neighborhood in the fold space to access its energy 
landscape [21]. 

3.2 CSP for Models Including Side Chains 

Given a valid side chain lattice protein structure (C b ,C s ) of length n, a move 
interval length k < n, and the start of the interval 1 < i < (n — k + 1). We define 
k variables X"° and X*, two for each position of the interval, with V(X^) = 
^(^f ) = L \ {&1 j * * * ) C s -i> Cg+fci • ■ ■ i C«> Cfi • ■ ■ i Cf-ii Cg+fej • ■ ■ j Cnl- To en- 
sure a valid structure, we enforce all-different (X^, . . . , X%, Xf, . . . , X^), Vi<i<fc : 
ncigh(X 4 b , Xf +1 ), and Vi<i<fc : neigh (X 4 b , Xf). Since we describe a substruc- 
ture, it has to be connected to the interval borders: if s > 1 : ncigh(X{ > , C^-J 
and if (s + k — 1) < n : neigh(Xj?, C b +fe ). Finally, we warrant the strictness 
of the fc-local moves and enforce that both ends of the interval differ from 
the old backbone or side chain placement, i.e. (X\ ^ C 4 b V X\ ^ Cf) and 

4 Application 

In the following, we applied the introduced move set to folding simulations of 
side chain lattice protein models in the 3D face-centered-cubic (FCC) lattice. In 
the FCC lattice, two lattice points l\ and li are neighbored, if and only if 

(Ji-k)e {±(1,1,0), ±(i,o,i), ±(o,i,i), ±(i,-i,o), ±(i,o,-i), ±(o,i,-i)}. 

Thus, each point of the FCC lattice has 12 neighbored positions, fc-local moves 



are known to be non-ergodic for backbone-only models [16] depending on k, the 
used lattice, and the protein length. We expect the same for models including 
side chains, but using the FCC and an intermediate k should shift the problem 
to long chain lengths. Thus, wc apply 3-local moves, i.e. with a maximal interval 
length k = 3 such that up to 6 monomers are moved (2 per amino acid). The 
implementation is based on Gecode [11]. To evaluate the structural difference 

between two structures (C b ,C s ) and (C b ,C s ) we calculate the distance and 
coordinate root mean square deviation (dRMSD and cRMSD) as given by Eq. 2 
and 3, respectively. The needed superpositioning utilizes Kabsch's algorithm [12]. 
We apply the contact based energy function E s that evaluates (only) side chain 
monomer contacts using the e 20 contact energy potentials from Sec. 2 similar to 
the backbone-only studies in [4, 29]. In the following, we use C as an abbreviation 



We derived a protein data from the Pisces web server [30] on June 23rd 2009. 
Only complete X-ray structures of 2. OA resolution or better with an R- value 
of 0.3 that contain side-chain data were considered. We used a 30% sequence 
identity cut-off. Since we are applying a simple contact-based energy function 
we filtered for short globular shaped proteins. Table 1 summarizes the used 
sequences and their corresponding Protein Data Bank (PDB) identifiers etc. 

For each full atom PDB structure Cpdb, we derived a lattice protein struc- 
ture Cfu that minimizes the dRMSD to Cpdb- This was done using LatFit 
from the LATPACK-tools package vl.7.0 4 [19]. Table 1 summarizes the resulting 
dRMSD and cRMSD values. 

Since the applied energy function is still a rough abstraction of the forces 
that guide the real folding process into Cpdb, no energy minimizing folding 
strategy will find the fitted lattice protein structure Cfu - Thus, we map Cfu to 
the according local minimum in the energy landscape. The mapping is done via 
a steepest decent or gradient walk. Starting from a given structure, at each step 
the neighbored structure with lowest energy is chosen for the next step until no 
such neighbor exists. Therefore, a gradient walk ends in a local minimum of the 
energy landscape, which we denote g(C) for a given start structure C. 

The g(Cfu) structures represent our "true" model to benchmark the fol- 
lowing folding scheme. The energies of Cfu and g(Cfu) and their structural 
differences to each other and to Cpdb are given in Table 1. 

The folding simulation procedure applied follows the idea of [29]. For each 
amino acid sequence S, we derive an according HP-sequence Shp using the 
translation table used in [29]. The derived Shp are given in Table 1. Follow- 
ing the observation of the hydrophobic collapse [1], we calculated HP-optimal 

4 Freely available at http://www.bioinf.uni-freiburg.de/Software/LatPack/ 



for (C 





E^CI^J 5 - cf\ - \c\ - c\|) 2 + (\a - q\ - fa - 1>,|) 2 

+ E,(|C i b -C7f|-|C b l -C7M) 2 



PDB ID - chain 


Sequences (original and HP transform) 


1BAZ-A 


SKMPQVNLRWPREVLDLVRKVABENGRSVNSEIYQRVMESFKKEGRIGA 
PPHPPHPHPHPPPHHPHHPPHPPPPPPPHPPPHHPPHHPPHPPPPPHPP 


1J8E-A 


GSHSCSSTQFKCNSGRCIPEHWTCDGDNDCGDYSDETHANCTNQ 
PPPPHPPPPHPHPPPPHHPPPHPHPPPPPHPPHPPPPPPPHPPP 


1RH6-A 


MYLTLQEWNARQRRPRSLETVRRWVRESRIFPPPVKDGREYLFHESAVKVDLNRP 
HHHPHPPHPPPPPPPPPHPPHPPHHPPPPHHPPPHPPPPPHHHPPPPHPHPHPPP 


1Z0J-B 


IEEELLLQQIDNIKAYIFDAKQCGRLDEVEVLTENLRELKHTLAKQKGGTD 
HPPPHHHPPHPPHPPHHHPPPPHPPHPPHPHHPPPHPPHPPPHPPPPPPPP 


2DS5-A 


GKLLYCSFCGKSQHEVRKLIAGPSVYICDECVDLCNDIIREEI 
PPHHHHPHHPPPPPPHPPHHPPPPHHHHPPHHPHHPPHHPPPH 


2EQ7-C 


LAMPAAERLMQEKGV3PAEVQGTGLGGRILKEDVMRH 
HPHPPPPPHHPPPPHPPPPHPPPPHPPPHHPPPHHPP 


2HBA-A 


MKVIFLKDVKGMGKKGEIKNVADGYANNFLFKQGLAIEATPANLKALEAQKQ 
HPHHHHPPHPPHPPPPPHPPHPPPHPPPHHHPPPHPHPPPPPPHPPHPPPPP 



PDB ID 




C/it tO CpDB 








to Cfu 


- chain 


n 


dRMSD 


cRMSD 


E(C, it ) 




dRMSD 


cRMSD 


1BAZ-A 


49 


0.886 A 


1.725 A 


-3.73 


-31.51 


4.050 A 


6.565 A 


1J8E-A 


44 


0.928 A 


1.939 A 


-3.54 


-30.76 


3.865 A 


6.857 A 


1RH6-A 


55 


0.921 A 


1.791 A 


1.33 


-38.17 


4.192 A 


8.243 A 


1Z0J-B 


51 


0.917 A 


2.095 A 


2.05 


-35.95 


3.185 A 


6.640 A 


2DS5-A 


43 


0.901 A 


1.750 A 


-4.35 


-34.36 


4.658 A 


7.755 A 


2EQ7-C 


37 


0.905 A 


1.813 A 


-3.07 


-20.58 


2.328 A 


4.751 A 


2HBA-A 


52 


0.890 A 


1.780 A 


-3.04 


-30.62 


3.224 A 


6.015 A 



Table 1. Used sequences, their HP transforms, length n, the quality of the fitted lattice 
protein model, and the according energies. 

structure representatives utilizing the CPSP-approach [3, 22, 20] and its latest 
extension HPrep [18]. The resulting HP-optimal structures are named Chp- 
For each Chp we run gradient walks and evaluated the resulting local minima 
found. The corresponding energies arc listed in Table 2. Furthermore, we per- 
formed a structural comparison of the resulting g(Cup) structures to our "true" 
models g(Cfu) from the fitting. The RMSD values are given in Table 2. 

In addition, we executed for each Cjj p random descending walks in order to 
sample the local minima of the energy landscape accessible from the collapsed 
starting structures. Here, at each step a random neighbor with lower energy 
is selected following a uniform distribution until no such neighbor exists. The 
lowest reached local minimum of all random descending walks starting at C is 
denoted by r{C). Energy and structural differences are given in Table 2. 

5 Discussion 

The gradient walks using 3-local moves starting from the fitted structures Cfu 
revealed that the currently applied contact based energy function using the en- 
ergy potentials e 20 , originally derived for backbone-only models [4], docs not 



PDB ID 


average values 


minimal values 


- chain 


(E(C H p)} 


min E(g(C H p)) 


mmE{r(C H p)) 


1BAZ-A 


-10.67 


-33.07 


-34.60 


1J8E-A 


-12.45 


-29.33 


-32.35 


1RH6-A 


-13.09 


-35.12 


-37.59 


1Z0J-B 


-13.42 


-34.71 


-37.69 


2DS5-A 


-6.97 


-31.00 


-32.53 


2EQ7-C 


-6.55 


-21.64 


-25.10 


2HBA-A 


-11.07 


-30.91 


-35.56 



PDB ID 


g(C HP ) 


vs. g(Cfit) 


r(C HP ) 


vs. g(C fu ) 


- chain 


dRMSD 


cRMSD 


dRMSD 


cRMSD 


1BAZ-A 


4.736 A 


8.797 A 


4.762 A 


9.360 A 


1J8E-A 


3.384 A 


7.508 A 


3.196 A 


7.052 A 


1RH6-A 


4.190 A 


9.645 A 


4.242 A 


10.156 A 


1Z0J-B 


5.609 A 


10.166 A 


6.232 A 


11.438 A 


2DS5-A 


3.588 A 


8.679 A 


3.425 A 


7.639 i 


2EQ7-C 


3.427 A 


7.247 A 


4.177 A 


8.401 i 


2HBA-A 


3.832 A 


8.848 A 


4.194 A 


9.075 A 



Table 2. Resulting energies and a structural comparison of the folding results. 

reflect the real forces present for models including side chains. This can be ob- 
served when comparing the energies E(Cfu) to E(g(C 'fa)) (see Table 1). An 
energy function that results in a smaller difference would be preferable, i.e. it 
would be a better model for the real forces guiding the folding process to Cpdb- 
In addition we could show, that the derived structures from our simple energy- 
optimizing folding simulation procedure are still quite dissimilar to the energy- 
optimized lattice fits of the real structures (see Table 2). We assume this mainly 
results from the simple energy function as well. 

To improve the results, we plan to apply more advanced energy functions, e.g. 
following [13]. Most important: the energy function has to consider the backbone 
positioning as well, which is not done by the contact-based energy functions from 
Sec. 2. Additionally, we want to apply distance based energy potentials that allow 
for a more realistic energy evaluation. Another direction of ongoing research is to 
further constrain the allowed structures. Here, we will directly benefit from the 
CP-based formulation of fc-local moves. Since we are formulating a CSP on valid 
structural neighbors, it is quite easy to post additional structural constraints. 
For instance, we can enforce a restriction on the allowed relative torsion angles 
along the protein chain (as e.g. done in [24]), that follows the observation of a 
limited degree of freedom in nature. 

6 Conclusions and Summary 

We introduced a CP-based approach to enumerate fc-local neighbors of a lattice 
protein structure in backbone-only and side chain lattice protein models. The 
generic approach can be applied for any local move length k within arbitrary 



lattices. Thus, it enables a fast prototyping of new folding simulation schemes 
or can be easily extended with additional constraints, e.g. restricted torsion an- 
gles. The CSP formulation enables the enumeration of the whole fc-local move 
neighborhood A4 (C) of a given structure C or the calculation of a random neigh- 
boring structure C r £ A4 (C) when applying a randomized search as possible in 
Gccode [11]. The application of symmetry breaking search [2] can be used to 
avoid the enumeration of symmetric structures, increasing the efficiency of fold- 
ing simulations [10]. We plan the incorporation of the A:-local move neighbor 
enumeration into our CH — h energy landscape library (ELL) [21]. This will open 
an easy interface for folding simulations in arbitrary lattices utilizing any energy 
function of interest. Furthermore, this will enable full kinetics studies based on 
the energy landscape topology. 

We will utilize the flexibility of the CP-based approach to incorporate ad- 
ditional structural constraints into the neighborhood generation. Following [24, 
23], it is beneficial to restrict torsion angles along the backbone or to exploit 
secondary structure information. 

Another advantage of the CP-based approach is its extensibility to constraint 
optimization problems (COP). Currently, we plan to incorporate the energy func- 
tion as the objective into the CSP, as e.g. done in [6, 9]. Thus, by solving a COP 
while optimizing the energy function, we can directly calculate the lowest energy 
neighbor of a structure following the framework of Pesant and Gendreau [26], 
which is needed e.g. for a gradient walk in the energy landscape as done in Sec. 4. 
Furthermore, this would enable an extension of the work of Zhang et al. [31]. 
They showed (for backbone-only models) that the performance of Monte Carlo 
folding simulations can be significantly increased using a greedy sequential re- 
growth of subchains. Thus, we plan to directly apply the sketched COP to calcu- 
late the optimal fragments for lattice proteins including side chains. Finally, the 
presented CP-based move set formulation can be easily extended to any other 
local move definition of interest. 
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